This is a self-contained data analysis report.

The main hypothesis of this study were evaluated in consideration of daytime pumping station and artificial habitat use, which were

  1. Roach will occupy pumping station when artificial habitat is absent

  2. When artificial habitat is introduced (pre-exclusion), roach will prefer the pumping station

  3. When excluded from the pumping station, artificial habitat occupancy will increase

  4. Roach will show a preferential change (compared to pre-exclusion) to occupying artificial habitat when the pumping station is reintroduced (post-exclusion)

  5. Artificial habitat occupancy will be higher in sheltered (vs unsheltered) treatments

Required libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(funModeling)
library(dlookr)
library(shiny)
library(glmmTMB)
library(ggeffects)
library(DHARMa)
library(ggpubr)

1. Load & prepare data

roach_wide.csv - wide data set where fish counts are stored in columns c_hab, c_open, c_ps

roach_wide=read_csv("./data/roach_wide.csv")

1.1 Data discovery

Determine dataset structure by checking number of rows, columns and data types (i.e., numerical, factors). Identify categorical factors and their levels.

nrow(roach_wide)
## [1] 6942
ncol(roach_wide)
## [1] 17
colnames(roach_wide)
##  [1] "date"          "time"          "trial"         "treatment"    
##  [5] "tank"          "light"         "sequence"      "day"          
##  [9] "hours_havail"  "hours_lout"    "c_hab"         "c_open_screen"
## [13] "c_open_cent"   "c_open_hab"    "c_open"        "c_ps"         
## [17] "c_both"

funModeling:df_status - For each variable it returns: Quantity and percentage of zeros (q_zeros and p_zeros respectively). Same metrics for NA values (q_NA/p_na), and infinite values (q_inf/p_inf). Last two columns indicates data type and quantity of unique values.

roach_wide_status<- funModeling::df_status(roach_wide, print_results = F)
variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
date 0 0.00 0 0.00 0 0 character 52
time 288 4.15 0 0.00 0 0 hms-difftime 24
trial 0 0.00 0 0.00 0 0 numeric 24
treatment 3471 50.00 0 0.00 0 0 numeric 2
tank 0 0.00 0 0.00 0 0 numeric 6
light 2616 37.68 0 0.00 0 0 numeric 2
sequence 0 0.00 0 0.00 0 0 numeric 4
day 0 0.00 0 0.00 0 0 numeric 13
hours_havail 0 0.00 3516 50.65 0 0 numeric 72
hours_lout 288 4.15 2334 33.62 0 0 numeric 16
c_hab 3201 46.11 0 0.00 0 0 numeric 13
c_open_screen 5683 81.86 0 0.00 0 0 numeric 13
c_open_cent 5331 76.79 0 0.00 0 0 numeric 13
c_open_hab 4510 64.97 0 0.00 0 0 numeric 13
c_open 3732 53.76 0 0.00 0 0 numeric 13
c_ps 3724 53.64 0 0.00 0 0 numeric 13
c_both 599 8.63 0 0.00 0 0 numeric 13

Several variables need to be converted to factors and labels added to levels.

1.1.1 Set factors and add labels

Create a lookup table for variable labels.

labels_table <- list(
  treatment = c('Covered (B)','Uncovered (A)'),
  light = c('Day', 'Night'),
  sequence = c('Baseline', 'I 1', 'I 2', 'I 3')
)

Convert variables to factors with specific labeling using a loop. Use the lookup table to get the labels for each variable.

for (var in names(labels_table)) {
  levels <- unique(roach_wide[[var]])
  labels <- labels_table[[var]]
  roach_wide[[var]] <- factor(roach_wide[[var]], levels = levels, labels = labels)
}

Convert other variables to factors without adding labels.

other_vars <- c("hours_havail", "hours_lout", "day", "trial")
roach_wide[other_vars] <- lapply(roach_wide[other_vars], as.factor)

1.2 Data diagnosis

1.2.1 NAs

Using the roach_wide_status dataframe consider variables with NA values.

roach_na <- roach_wide_status %>%
  filter(q_na > 0) %>%
  arrange(-p_na) %>%
  select(variable, q_na, p_na)
variable q_na p_na
hours_havail 3516 50.65
hours_lout 2334 33.62

Variables with NA not considered in main analysis, NAs can be omitted for visual analysis.

1.2.2 0’s

Using the roach_wide_status dataframe consider the spread of zeros in habitat count variables

roach_0 <- roach_wide_status %>%
  filter(variable %in% c("c_hab", "c_ps", "c_open", "c_open_cent", "c_open_screen", "c_open_hab")) %>%
  arrange(-p_zeros) %>%
  select(variable, q_zeros, p_zeros)
variable q_zeros p_zeros
c_open_screen 5683 81.86
c_open_cent 5331 76.79
c_open_hab 4510 64.97
c_open 3732 53.76
c_ps 3724 53.64
c_hab 3201 46.11

High proportion of 0’s in all counts, but will be confounded without consideration for light period and habitat availability. Regardless, the presence of high 0’s will significantly skew variability and make raw counts hard to interpret. Descriptive statistics would be limited by high IQR and misleading min/max values. Consider rescaling count variables for analysis.

1.2.3 Outliers

Check the main dataframe for outliers.

dlookr::plot_outlier - for each variable specified the function plots outlier information for numerical data diagnosis

dlookr::plot_outlier(roach_wide, "c_hab", "c_ps", "c_open")

1.2.4 Normality

Check the main dataframe for data distribution.

dlookr::plot_normality - for each variable specified the function determines normality by plotting histogram and q-q plot of the original data, log transformed, and square root transformed.

dlookr::plot_normality(roach_wide, "c_hab", "c_ps", "c_open")

1.3 Rescale count data

The code performs a two-step transformation: it first normalizes raw fish count data to a 0-1 scale, and then ensures that the normalized values have a small positive offset to prevent division by zero issues for modelling.

roach_wide <- roach_wide %>%
  mutate(across(c(c_hab, c_ps, c_open), ~ . / max(.), .names = "{.col}_normalized"),
         across(ends_with("_normalized"), ~ ifelse(. == 0, . + 0.0000000001, .)))

1.4 Summarise

First, create minimal dataset with variables of interest.

roach_wide_sum = select(roach_wide, c_hab, c_ps, c_open, c_hab_normalized, c_ps_normalized, c_open_normalized, sequence, light, treatment)

funModeling::profile_num - Provides an expanded summary including mean, standard deviation, skewness and kurtosis.

Skewness >0 = right-skew, <0 = left-skew, 0 = symmetric. kurtosis >3 = sharp, <3 = flat, 3 = normal

numeric_prof <-funModeling::profiling_num(roach_wide_sum)
variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98 range_80
c_hab 4.1780467 4.7238987 1.130648 0 0 0 2.0000000 8.0000000 12 12 0.6180048 1.782165 8.0000000 [0, 12] [0, 12]
c_ps 4.3388073 5.4305629 1.251626 0 0 0 0.0000000 12.0000000 12 12 0.6121211 1.476341 12.0000000 [0, 12] [0, 12]
c_open 3.4462691 4.3204959 1.253673 0 0 0 0.0000000 7.0000000 12 12 0.8004510 2.112379 7.0000000 [0, 12] [0, 11]
c_hab_normalized 0.3481706 0.3936582 1.130648 0 0 0 0.1666667 0.6666667 1 1 0.6180048 1.782165 0.6666667 [1e-10, 1] [1e-10, 1]
c_ps_normalized 0.3615673 0.4525469 1.251626 0 0 0 0.0000000 1.0000000 1 1 0.6121211 1.476341 1.0000000 [1e-10, 1] [1e-10, 1]
c_open_normalized 0.2871891 0.3600413 1.253673 0 0 0 0.0000000 0.5833333 1 1 0.8004510 2.112379 0.5833333 [1e-10, 1] [1e-10, 0.916666666666667]

Standard deviations of raw counts are high, as expected from normality plots.

Examine raw count descriptive summary before considering rescaled values.

count_sum <- bind_rows(
  roach_wide_sum %>%
    group_by(light) %>%
    summarise(sum_c_hab = sum(c_hab),
              sum_c_ps = sum(c_ps),
              sum_c_open = sum(c_open)) %>%
    rename(variable = light),
  roach_wide_sum %>%
    group_by(sequence) %>%
    summarise(sum_c_hab = sum(c_hab),
              sum_c_ps = sum(c_ps),
              sum_c_open = sum(c_open)) %>%
    rename(variable = sequence),
  roach_wide_sum %>%
    summarise(sum_c_hab = sum(c_hab),
              sum_c_ps = sum(c_ps),
              sum_c_open = sum(c_open),
              variable = "Total") %>%
    mutate(variable = factor(variable)))
variable sum_c_hab sum_c_ps sum_c_open
Light period
Day 15011 13426 2790
Night 13993 16694 21134
Experimental sequence
Baseline 0 15014 6283
I 1 4483 13129 3041
I 2 13272 0 7327
I 3 11249 1977 7273
Total count
Total 29004 30120 23924

Examine rescaled counts

rescaled_mean <- bind_rows(
  roach_wide_sum %>%
    group_by(light) %>%
    summarise(mean_c_hab = mean(c_hab_normalized),
              mean_c_ps = mean(c_ps_normalized),
              mean_c_open = mean(c_open_normalized)) %>%
    rename(variable = light),
  roach_wide_sum %>%
    group_by(sequence) %>%
    summarise(mean_c_hab = mean(c_hab_normalized),
              mean_c_ps = mean(c_ps_normalized),
              mean_c_open = mean(c_open_normalized)) %>%
    rename(variable = sequence),
    roach_wide_sum %>% filter(light=="Day")%>%
    group_by(treatment) %>%
    summarise(mean_c_hab = mean(c_hab_normalized),
              mean_c_ps = mean(c_ps_normalized),
              mean_c_open = mean(c_open_normalized)) %>%
    rename(variable = treatment))
variable mean_c_hab mean_c_ps mean_c_open
Light period
Day 0.4781792 0.4276886 0.0888761
Night 0.2695523 0.3215827 0.4071120
Experimental sequence
Baseline 0.0000000 0.7044857 0.2948104
I 1 0.2161941 0.6331501 0.1466532
I 2 0.6400463 0.0000000 0.3533468
I 3 0.5481969 0.0963450 0.3544347
Treatment (daytime)
Covered (B) 0.4968782 0.4083206 0.0899592
Uncovered (A) 0.4594801 0.4470566 0.0877931

2 Visualise main relationships

Apply custom themeing for ggplot2 objects.

theme_JN <- function(base_size=12){ 
  theme_grey() %+replace%
    theme(
      axis.text = element_text(colour="black"),
      axis.title = element_text(colour="black"),
      axis.ticks = element_line(colour="black"),
      panel.border = element_rect(colour = "black", fill=NA),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_blank(),
      strip.background = element_rect(colour = "black",fill = NA),
      panel.spacing.x = unit(12, "pt")
    ) 
}

Plot raw count data first to confirm suitability of rescaling data.

ggplot(roach_wide_sum, aes(x = sequence, y= c_ps)) +
  geom_boxplot() +theme_JN()+theme(text=element_text(size=20))

ggplot(roach_wide_sum, aes(x = sequence, y= c_hab)) +
  geom_boxplot()+theme_JN()+theme(text=element_text(size=20))

ggplot(roach_wide_sum, aes(x = sequence, y= c_open)) +
  geom_boxplot()+theme_JN()+theme(text=element_text(size=20))

The boxplots confirm raw count data are unsuitable for analysis due to large variation in counts presenting large IQR + outliers. Descriptive data e.g., medians are therefore hard to interpret and generalise.

From here on all analysis considers rescaled counts.

#Artificial habitat,light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
         group_by(light) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = light, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#pumping station, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
         group_by(light) %>%
         summarise(mean_c_hab = mean(c_ps_normalized),
                   se_c_hab = sd(c_ps_normalized) / sqrt(n())),
       aes(x = light, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#open water, light
ggplot(roach_wide_sum %>%
         group_by(light) %>%
         summarise(mean_c_hab = mean(c_open_normalized),
                   se_c_hab = sd(c_open_normalized) / sqrt(n())),
       aes(x = light, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#Artificial habitat, sequence, light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
         group_by(sequence, light) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#pumping station, sequence, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
         group_by(sequence, light) %>%
         summarise(mean_c_hab = mean(c_ps_normalized),
                   se_c_hab = sd(c_ps_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#open water, sequence, light
ggplot(roach_wide_sum %>%
         group_by(sequence, light) %>%
         summarise(mean_c_hab = mean(c_open_normalized),
                   se_c_hab = sd(c_open_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

By plotting the main visual relationships we’re able to quickly identify relationships in the data and begin addressing study objectives and hypothesis.

Here, we can consider that the day/night relationship in habitat occupancy was relatively fixed throughout the experiment in all habitat options. We see that open water occupancy was decreased in intervention 1, possibly attributed to introducing artificial habitat.

  1. H1 was supported, fish occupied the pumping station during the day
  2. H2 was partially supported, there was no equal preference but high variation in fish behavior
  3. H3 supported, habitat occupancy increased in artificial habitat when excluded from pumping station
  4. H4 supported, habitat occupancy was highest in artificial habitat post-exclusion. Preferential change demonstrated

To consider H5, let’s examine treatment effects.

#Artificial habitat, sequence, treatment
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
         group_by(sequence, treatment) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = treatment)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#pumping station, sequence, treatment
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
         group_by(sequence, treatment) %>%
         summarise(mean_c_hab = mean(c_ps_normalized),
                   se_c_hab = sd(c_ps_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = treatment)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#open water, sequence, treatment
ggplot(roach_wide_sum %>%
         group_by(sequence, treatment) %>%
         summarise(mean_c_hab = mean(c_open_normalized),
                   se_c_hab = sd(c_open_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = treatment)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#Artificial habitat, treatment, light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
         group_by(treatment, light) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = treatment, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#daytime counts highest in covered treatments

#pumping station, treatment, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
         group_by(treatment, light) %>%
         summarise(mean_c_hab = mean(c_ps_normalized),
                   se_c_hab = sd(c_ps_normalized) / sqrt(n())),
       aes(x = treatment, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#open water, treatment, light
ggplot(roach_wide_sum %>%
         group_by(treatment, light) %>%
         summarise(mean_c_hab = mean(c_open_normalized),
                   se_c_hab = sd(c_open_normalized) / sqrt(n())),
       aes(x = treatment, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#Artificial habitat, treatment (daytime)
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
         group_by(treatment, light) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = treatment, y = mean_c_hab, colour = light)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

#Artificial habitat, sequence, treatment (daytime)
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
         group_by(sequence, treatment) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = sequence, y = mean_c_hab, colour = treatment)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

In these plots we see that treatment has an impact on habitat occupancy and in all sequences, habitat occupancy in artificial habitat was highest in uncovered treatments.Interestingly, occupancy in pumping station shows the opposite relationship to artificial habitat. Pumping station occupancy was highest in uncovered treatments, suggesting that cover is important i.e., when uncovered habitat is provided, more fish occupy the pumping station. Overall, we found support for H5.

Before continuing with statistical analysis let’s examine repeated measures and temporal effects.

#Artificial habitat, trial (daytime)
ggplot(roach_wide %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
         group_by(trial) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = trial, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

The variation between trials suggests significant effect of within-subject observations (i.e., repeated measures). Additionally, there appears to be temporal influence (ex. trial 1 compared to trial 24)

ggplot(roach_wide %>% filter(sequence!="Baseline")%>%
         group_by(time) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = time, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

Evidence for hour-to-hour variation. Ensure is included in analysis.

Finally, considering further relationships not required for main objectives.

ggplot(roach_wide %>%
         filter(!is.na(hours_havail))%>% filter(sequence =="I 1")%>%
         group_by(hours_havail) %>%
         summarise(mean_c_hab = mean(c_hab_normalized),
                   se_c_hab = sd(c_hab_normalized) / sqrt(n())),
       aes(x = hours_havail, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

roach_wide %>%
  filter(!is.na(hours_havail)) %>%
  filter(sequence == "I 1") %>%
  mutate(hours = as.numeric(hours_havail)) %>%
  with(cor.test(hours, c_hab_normalized))
## 
##  Pearson's product-moment correlation
## 
## data:  hours and c_hab_normalized
## t = 4.8211, df = 1702, p-value = 1.555e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06896398 0.16266011
## sample estimates:
##       cor 
## 0.1160703

There is no real relationship. Occupancy peaks 24h after introduction, as expected due to nocturnal behavior. Correlation testing shows a weak positive correlation, but significant.

ggplot(roach_wide %>%
         filter(!is.na(hours_lout))%>%
         group_by(hours_lout) %>%
         summarise(mean_c_hab = mean(c_open_normalized),
                   se_c_hab = sd(c_open_normalized) / sqrt(n())),
       aes(x = hours_lout, y = mean_c_hab)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
                width = 0.2)+theme_JN()+theme(text=element_text(size=20))

roach_wide %>%
  filter(!is.na(hours_lout)) %>%
  mutate(hours = as.numeric(hours_lout)) %>%
  with(cor.test(hours, c_open_normalized))
## 
##  Pearson's product-moment correlation
## 
## data:  hours and c_open_normalized
## t = 9.4592, df = 4606, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1096056 0.1662548
## sample estimates:
##       cor 
## 0.1380431

Again, no strong relationship. Occupancy peaks within 1h of lights out. Correlation testing shows a weak positive correlation, but significant.

Ordinarily we may now choose to perform group comparisons etc on our variables of interest to support the relationships shown in the visual analysis. However, the nature of the problem is complex and thus is best treated by a modelling approach.


3 GLMM

Data exploration has shown:

  • Continuous non-negative response variable with zero values treated by adding 0.000000001
  • Left-skewed, Gamma distribution undesired
  • Gaussian distribution suitable for continuous response. Apply log-link to account for uneven variance in sequence groups.
  • Raw counts would be overdispersed, treated by rescaling.
  • No NAs
  • No outliers
  • Non-normally distributed response variable, but large sample size
  • No zeros in response
  • No continuous predictors, so no collinearity issues
  • No interactions
  • No Independence of response variables. Repeated measures design requires random effect of trial
  • Temporal dependency should be accounted for with random effects
table(roach_wide$sequence) 
## 
## Baseline      I 1      I 2      I 3 
##     1776     1728     1728     1710
table(roach_wide$treatment) 
## 
##   Covered (B) Uncovered (A) 
##          3471          3471
table(roach_wide$light)
## 
##   Day Night 
##  2616  4326

The data is well-balanced across the grouping variables except light, which is expected due to day night differences.

3.1 Data preperation

Let’s extract hour from the time variable and store as factor for random effect modelling.

roach_wide <- roach_wide %>%
  mutate(time_factor = factor(as.numeric(format(strptime(time, format = "%H:%M:%S"), "%H"))))

Create a new dataframe for binary modelling later.

#Create new DF for binary model
roach_binary <- roach_wide %>%filter (sequence=="I 1"|sequence=="I 3")%>%filter(light=="Day")
roach_binary$binary <- ifelse(roach_binary$sequence =="I 1", 0,1)

3.2 Build GLMM

3.2.1 Artifical habitat

Start by creating a null (intercept only) model.

mod1 <- glmmTMB(c_hab_normalized ~ 1, data = roach_wide%>%filter(sequence!="Baseline"),
                family = gaussian(link="log"))
summary(mod1)
##  Family: gaussian  ( log )
## Formula:          c_hab_normalized ~ 1
## Data: roach_wide %>% filter(sequence != "Baseline")
## 
##      AIC      BIC   logLik deviance df.resid 
##   4939.5   4952.6  -2467.8   4935.5     5164 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.152 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7596     0.0116  -65.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Intrepting the null model is not important here, but we can see that the estimated mean of our normalised count variable is significantly different from 0. We will use this null model for determining model fit by comparing subsequent models to the null using AIC selection.

Let’s add our fixed effects of interest.

mod1.1 <- update(mod1, c_hab_normalized ~ sequence + light + treatment)
summary(mod1.1)
##  Family: gaussian  ( log )
## Formula:          c_hab_normalized ~ sequence + light + treatment
## Data: roach_wide %>% filter(sequence != "Baseline")
## 
##      AIC      BIC   logLik deviance df.resid 
##   2422.0   2461.3  -1205.0   2410.0     5160 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.0934 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.20486    0.03538  -34.05  < 2e-16 ***
## sequenceI 2             1.12453    0.03607   31.18  < 2e-16 ***
## sequenceI 3             1.02292    0.03656   27.98  < 2e-16 ***
## lightNight             -0.62328    0.01728  -36.07  < 2e-16 ***
## treatmentUncovered (A) -0.08459    0.01619   -5.22 1.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model outputs are sensible and have reduced the AIC of the null model from 4939 to 2422. All our predictor terms are significant with sensible coefficent and error estimates. For example, we can see that the mean value of c_hab_normalized is reduced by -0.623 units during night time light conditions.

It can be useful to plot model predictions early whilst building a complete model to understand how adding independent variables affects the estimated response.

plot(ggpredict(mod1.1, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))

Here, we can see that the predicted habitat occupancy response is sensible, and close to the observed values we plotted earlier.

Importantly, random effects need to be added to account for repeated measures and temporal dependency.

mod1.2 <- update(mod1.1, . ~ . + (1 | time_factor/day/trial))
summary(mod1.2)
##  Family: gaussian  ( log )
## Formula:          
## c_hab_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide %>% filter(sequence != "Baseline")
## 
##      AIC      BIC   logLik deviance df.resid 
##   2401.6   2460.5  -1191.8   2383.6     5157 
## 
## Random effects:
## 
## Conditional model:
##  Groups                Name        Variance  Std.Dev.
##  trial:day:time_factor (Intercept) 1.001e-10 0.00001 
##  day:time_factor       (Intercept) 7.687e-03 0.08767 
##  time_factor           (Intercept) 9.906e-04 0.03147 
##  Residual                          9.123e-02 0.30204 
## Number of obs: 5166, groups:  
## trial:day:time_factor, 5166; day:time_factor, 216; time_factor, 24
## 
## Dispersion estimate for gaussian family (sigma^2): 0.0912 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.19015    0.03873 -30.731  < 2e-16 ***
## sequenceI 2             1.12143    0.03946  28.418  < 2e-16 ***
## sequenceI 3             0.98280    0.04094  24.007  < 2e-16 ***
## lightNight             -0.62157    0.02527 -24.600  < 2e-16 ***
## treatmentUncovered (A) -0.08580    0.01598  -5.369  7.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod1.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))

As before, the model predictions are improved and the within-subject variation has been handled by adding the random effect of trial.

We can check the goodness of fit by plotting residuals using DHARMA.

fittedmod1.2 <- mod1.2
simuout1 <- simulateResiduals(fittedModel = fittedmod1.2)
plot(simuout1, quantreg = T)

The residuals generally follow a linear relationship and he deviation is accepted within a generalised framework. The quantile deviations present are likely a result of dispersion due to clumped values close to 0 and 1.

The model process is now repeated for the remaining response variables; pumping station and open water.

3.2.2 Pumping station

#Null model
mod2 <- glmmTMB(c_ps_normalized ~ 1, data = roach_wide%>%filter(sequence!="I 2"),
                family = gaussian(link="log"))
summary(mod2)
##  Family: gaussian  ( log )
## Formula:          c_ps_normalized ~ 1
## Data: roach_wide %>% filter(sequence != "I 2")
## 
##      AIC      BIC   logLik deviance df.resid 
##   6784.9   6798.0  -3390.4   6780.9     5212 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.215 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.73106    0.01334  -54.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add fixed effects
mod2.1 <- update(mod2, c_ps_normalized ~ sequence + light + treatment)
summary(mod2.1)
##  Family: gaussian  ( log )
## Formula:          c_ps_normalized ~ sequence + light + treatment
## Data: roach_wide %>% filter(sequence != "I 2")
## 
##      AIC      BIC   logLik deviance df.resid 
##   4372.6   4411.9  -2180.3   4360.6     5208 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.135 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.20719    0.01776 -11.665  < 2e-16 ***
## sequenceI 1            -0.10587    0.01847  -5.733 9.88e-09 ***
## sequenceI 3            -2.01448    0.09481 -21.248  < 2e-16 ***
## lightNight             -0.29107    0.01827 -15.932  < 2e-16 ***
## treatmentUncovered (A)  0.05181    0.01827   2.836  0.00457 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add random effects
mod2.2 <- update(mod2.1, . ~ . + (1 | time_factor/day/trial))
summary(mod2.2)
##  Family: gaussian  ( log )
## Formula:          
## c_ps_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide %>% filter(sequence != "I 2")
## 
##      AIC      BIC   logLik deviance df.resid 
##   4218.1   4277.1  -2100.1   4200.1     5205 
## 
## Random effects:
## 
## Conditional model:
##  Groups                Name        Variance  Std.Dev. 
##  trial:day:time_factor (Intercept) 9.025e-02 3.004e-01
##  day:time_factor       (Intercept) 2.876e-11 5.363e-06
##  time_factor           (Intercept) 3.499e-13 5.915e-07
##  Residual                          1.061e-01 3.257e-01
## Number of obs: 5214, groups:  
## trial:day:time_factor, 5214; day:time_factor, 218; time_factor, 24
## 
## Dispersion estimate for gaussian family (sigma^2): 0.106 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.26812    0.01981 -13.536  < 2e-16 ***
## sequenceI 1            -0.09148    0.01932  -4.736 2.18e-06 ***
## sequenceI 3            -2.05188    0.08633 -23.768  < 2e-16 ***
## lightNight             -0.25369    0.01920 -13.212  < 2e-16 ***
## treatmentUncovered (A)  0.04263    0.01915   2.227    0.026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod2.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))

3.2.3 Open water

#Null model
mod3 <- glmmTMB(c_open_normalized ~ 1, data = roach_wide,
                family = gaussian(link="log"))
summary(mod3)
##  Family: gaussian  ( log )
## Formula:          c_open_normalized ~ 1
## Data: roach_wide
## 
##      AIC      BIC   logLik deviance df.resid 
##   5520.5   5534.2  -2758.3   5516.5     6940 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.13 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.24761    0.01505  -82.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add fixed effects
mod3.1 <- update(mod3, c_open_normalized ~ sequence + light + treatment)
summary(mod3.1)
##  Family: gaussian  ( log )
## Formula:          c_open_normalized ~ sequence + light + treatment
## Data: roach_wide
## 
##      AIC      BIC   logLik deviance df.resid 
##   3544.6   3592.6  -1765.3   3530.6     6935 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.0974 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.47386    0.07071  -34.99  < 2e-16 ***
## sequenceI 1            -0.54735    0.04770  -11.48  < 2e-16 ***
## sequenceI 2             0.21837    0.03055    7.15 8.86e-13 ***
## sequenceI 3             0.28354    0.02989    9.49  < 2e-16 ***
## lightNight              1.49549    0.06583   22.72  < 2e-16 ***
## treatmentUncovered (A)  0.08195    0.02220    3.69 0.000223 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add random effects
mod3.2 <- update(mod3.1, . ~ . + (1 | time_factor/day/trial))
summary(mod3.2)
##  Family: gaussian  ( log )
## Formula:          
## c_open_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide
## 
##      AIC      BIC   logLik deviance df.resid 
##   3162.7   3231.2  -1571.4   3142.7     6932 
## 
## Random effects:
## 
## Conditional model:
##  Groups                Name        Variance  Std.Dev. 
##  trial:day:time_factor (Intercept) 3.340e-01 5.780e-01
##  day:time_factor       (Intercept) 2.883e-11 5.369e-06
##  time_factor           (Intercept) 9.907e-16 3.148e-08
##  Residual                          5.889e-02 2.427e-01
## Number of obs: 6942, groups:  
## trial:day:time_factor, 6942; day:time_factor, 290; time_factor, 24
## 
## Dispersion estimate for gaussian family (sigma^2): 0.0589 
## 
## Conditional model:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.39186    0.05561  -43.01  < 2e-16 ***
## sequenceI 1            -0.65043    0.04127  -15.76  < 2e-16 ***
## sequenceI 2             0.09886    0.03416    2.89 0.003804 ** 
## sequenceI 3             0.12975    0.03364    3.86 0.000115 ***
## lightNight              1.38818    0.04682   29.65  < 2e-16 ***
## treatmentUncovered (A)  0.03996    0.02421    1.65 0.098784 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod3.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))

3.2.4 Binary probability model

To separately analyse the effect of habitat exclusion on overall probability of fish occupying a given habitat post-exclusion (sequence: I 3) let’s build a pair of GLMMs in glmmTMB with binomial distribution (link = logit). In these models, fixed effects are limited to experimental sequence (I 1, I 3) and used the same random effects specified in the main models.

mod_binary_ah <- glmmTMB(c_hab_normalized ~ as.factor(binary) + (1 | time_factor/day/trial), data = roach_binary, family = binomial())
summary(mod_binary_ah)
##  Family: binomial  ( logit )
## Formula:          
## c_hab_normalized ~ as.factor(binary) + (1 | time_factor/day/trial)
## Data: roach_binary
## 
##      AIC      BIC   logLik deviance df.resid 
##   1258.3   1284.1   -624.2   1248.3     1273 
## 
## Random effects:
## 
## Conditional model:
##  Groups                Name        Variance  Std.Dev. 
##  trial:day:time_factor (Intercept) 1.903e-09 4.363e-05
##  day:time_factor       (Intercept) 6.201e-02 2.490e-01
##  time_factor           (Intercept) 7.335e-28 2.708e-14
## Number of obs: 1278, groups:  
## trial:day:time_factor, 1278; day:time_factor, 60; time_factor, 10
## 
## Conditional model:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.1126     0.1031  -10.79   <2e-16 ***
## as.factor(binary)1   2.9091     0.1643   17.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model predicts a significant increase in the probability of fish occupying artificial habitat post-exclusion.

plot(ggpredict(mod_binary_ah, terms = c("binary")))+theme_JN()+theme(text=element_text(size=20))

mod_binary_ps <- glmmTMB(c_ps_normalized ~ as.factor(binary) + (1 | time_factor/day/trial), data = roach_binary, family = binomial())
summary(mod_binary_ps)
##  Family: binomial  ( logit )
## Formula:          
## c_ps_normalized ~ as.factor(binary) + (1 | time_factor/day/trial)
## Data: roach_binary
## 
##      AIC      BIC   logLik deviance df.resid 
##   1127.6   1153.3   -558.8   1117.6     1273 
## 
## Random effects:
## 
## Conditional model:
##  Groups                Name        Variance  Std.Dev. 
##  trial:day:time_factor (Intercept) 2.877e-09 5.364e-05
##  day:time_factor       (Intercept) 4.603e-12 2.145e-06
##  time_factor           (Intercept) 1.232e-20 1.110e-10
## Number of obs: 1278, groups:  
## trial:day:time_factor, 1278; day:time_factor, 60; time_factor, 10
## 
## Conditional model:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.05317    0.08971   11.74   <2e-16 ***
## as.factor(binary)1 -3.38536    0.16652  -20.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model predicts a significant decrease in the probability of fish occupying pumping station post-exclusion.

plot(ggpredict(mod_binary_ps, terms = c("binary")))+theme_JN()+theme(text=element_text(size=20))


3.3 Save model predictions

We will save the individual model predictions using ggpredict and then combine and restructure the dataframe using dplyr from the tidyverse.

c_hab_glmmm <-ggpredict(mod1.2, terms = c("sequence", "light"))
c_ps_glmmm <-ggpredict(mod2.2, terms = c("sequence", "light"))
c_open_glmmm <-ggpredict(mod3.2, terms = c("sequence", "light"))
c_hab_glmmm_treat <-ggpredict(mod1.2, terms = c("sequence", "treatment", "light[Day]"))

Next, a grouping variable for each habitat type is added. A second variable ‘present’ is added to indicate if data is present for the experimental sequence.

c_hab_glmmm <- c_hab_glmmm %>% mutate(habitat = "Artificial habitat", present = "T")
c_ps_glmmm <- c_ps_glmmm %>% mutate(habitat = "Pumping station", present = "T")
c_open_glmmm <- c_open_glmmm %>% mutate(habitat = "Open water", present = "T")

The dataframes are then combined and restructured for plotting. First, the rows are bound then NA values are omitted and dataframe attributes are removed (for simplicity, not necessity). Dummy rows are added for the sequence groups without data, and variables are then converted to factors with specific levels.

modeloutput_df <- bind_rows(c_hab_glmmm, c_ps_glmmm, c_open_glmmm)
modeloutput_df <- na.omit(modeloutput_df)
modeloutput_df <- as.data.frame(unclass(modeloutput_df))
modeloutput_df <- modeloutput_df %>%
  add_row(x = 'Baseline', group = 'Day', habitat = 'Artificial habitat', present = "F") %>%
  add_row(x = 'Baseline', group = 'Night', habitat = 'Artificial habitat', present = "F") %>%
  add_row(x = 'I 2', group = 'Day', habitat = 'Pumping station', present = "F") %>%
  add_row(x = 'I 2', group = 'Night', habitat = 'Pumping station', present = "F")
modeloutput_df$x <- factor(modeloutput_df$x, levels = c('Baseline', 'I 1', 'I 2', 'I 3'))
modeloutput_df$habitat <- as.factor(modeloutput_df$habitat)
modeloutput_df$habitat <- factor(modeloutput_df$habitat, levels = c('Pumping station', 'Open water', 'Artificial habitat'))
x predicted std.error conf.low conf.high group habitat present
I 1 0.3041765 0.0387274 0.2819427 0.3281637 Day Artificial habitat T
I 1 0.1633742 0.0409847 0.1507639 0.1770393 Night Artificial habitat T
I 2 0.9335923 0.0222946 0.8936759 0.9752916 Day Artificial habitat T
I 2 0.5014354 0.0227331 0.4795839 0.5242825 Night Artificial habitat T
I 3 0.8127345 0.0240701 0.7752828 0.8519954 Day Artificial habitat T
I 3 0.4365223 0.0238856 0.4165575 0.4574439 Night Artificial habitat T
Baseline 0.7648180 0.0198081 0.7356943 0.7950946 Day Pumping station T
Baseline 0.5934450 0.0191550 0.5715783 0.6161482 Night Pumping station T
I 1 0.6979559 0.0205494 0.6704035 0.7266407 Day Pumping station T
I 1 0.5415647 0.0197917 0.5209591 0.5629853 Night Pumping station T
I 3 0.0982743 0.0863840 0.0829678 0.1164046 Day Pumping station T
I 3 0.0762539 0.0867094 0.0643361 0.0903794 Night Pumping station T
Baseline 0.0914594 0.0556065 0.0820155 0.1019907 Day Open water T
Baseline 0.3665273 0.0286108 0.3465395 0.3876679 Night Open water T
I 1 0.0477254 0.0596329 0.0424610 0.0536426 Day Open water T
I 1 0.1912617 0.0373060 0.1777760 0.2057704 Night Open water T
I 2 0.1009631 0.0519537 0.0911884 0.1117855 Day Open water T
I 2 0.4046138 0.0272225 0.3835914 0.4267883 Night Open water T
I 3 0.1041301 0.0506727 0.0942852 0.1150029 Day Open water T
I 3 0.4173058 0.0265197 0.3961693 0.4395699 Night Open water T
Baseline NA NA NA NA Day Artificial habitat F
Baseline NA NA NA NA Night Artificial habitat F
I 2 NA NA NA NA Day Pumping station F
I 2 NA NA NA NA Night Pumping station F

A similar process follows for the treatment dataframe.

c_hab_glmmm_treat <- as.data.frame(unclass(c_hab_glmmm_treat))
c_hab_glmmm_treat <- c_hab_glmmm_treat %>% mutate(present = "T")
c_hab_glmmm_treat <- c_hab_glmmm_treat %>% select(-facet)
c_hab_glmmm_treat <- c_hab_glmmm_treat %>%
  add_row(x = 'Baseline', group = 'Covered (B)', present = "F") %>%
  add_row(x = 'Baseline', group = 'Uncovered (A)', present = "F")
c_hab_glmmm_treat$x <- factor(c_hab_glmmm_treat$x, levels = c('Baseline', 'I 1', 'I 2', 'I 3'))
c_hab_glmmm_treat$group <- factor(c_hab_glmmm_treat$group, levels = c('Uncovered (A)', 'Covered (B)'))
x predicted std.error conf.low conf.high group present
I 1 0.3041765 0.0387274 0.2819427 0.3281637 Covered (B) T
I 1 0.2791664 0.0388745 0.2586861 0.3012682 Uncovered (A) T
I 2 0.9335923 0.0222946 0.8936759 0.9752916 Covered (B) T
I 2 0.8568302 0.0228311 0.8193338 0.8960425 Uncovered (A) T
I 3 0.8127345 0.0240701 0.7752828 0.8519954 Covered (B) T
I 3 0.7459096 0.0245659 0.7108463 0.7827025 Uncovered (A) T
Baseline NA NA NA NA Covered (B) F
Baseline NA NA NA NA Uncovered (A) F

The binary model predictions are then saved. The grouping variable is modified to allow for easy plotting.

ah_prob <- ggpredict(mod_binary_ah, terms = c("binary"))
ps_prob <- ggpredict(mod_binary_ps, terms = c("binary"))

ps_prob$group <- 2 
ps_prob$group <- factor(ps_prob$group)

habitat_prob_df <- bind_rows(ah_prob, ps_prob)
x predicted std.error conf.low conf.high group
0 0.2473919 0.1030883 0.2117147 0.2868931 1
1 0.8577281 0.1261066 0.8248246 0.8853107 1
0 0.7413837 0.0897146 0.7062698 0.7736453 2
1 0.0884921 0.1402805 0.0686808 0.1133226 2

4 Model output figures

The final model outputs can now be plotted. The following figures are finalized versions with some edits made after exporting from R, thus the code snippet will not produce an identical figure.


ggplot(data=modeloutput_df, aes(x=x, y=predicted, fill=group))+
  geom_tile(aes(x=x, y=0.5,height = Inf, fill=present), alpha = 0.3,  show.legend = FALSE) + 
  scale_fill_manual(values = c("T" = "grey80", "F" = "grey90"), guide = FALSE) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), size=0.5,width = 0.6 ,position = position_dodge(width = 0.7),  show.legend = FALSE) +
  #geom_path(aes(group = interaction(habitat, group), linetype = group), linewidth = 0.3 ,position = position_dodge(width = 0.7), show.legend = FALSE) +
  scale_linetype_manual(values = c("Day" = "dashed", "Night" = "dotted"))+
  geom_point(aes(shape=group),size=2, position = position_dodge(width = 0.7),  show.legend = FALSE) +
  scale_shape_manual(values = c("Day" = 20, "Night" = 4))+
  scale_y_continuous(breaks = seq(0, 1,0.1), limits=c(0,1),expand=c(0.05,0)) +
  scale_x_discrete(expand=c(0,0))+
  labs(x = 'Experimental sequence',y= 'Habitat occupancy')+
  theme_JN()+
  theme(axis.text.x=element_text(size=8),
        strip.background = element_rect(fill = "grey90"),
        panel.spacing.x =unit(0, "lines") ) + 
  facet_grid(~ habitat, scales = "fixed")+
  geom_text(data = modeloutput_df %>% filter(present == "F"), aes(x = x, y = 0.5, label = "Unavailable"), size = 8/.pt, angle = 90, fontface = "italic")


Fig 1. Habitat occupancy fitted by Generalised Linear Mixed Models (log-linked Gaussian distributions) across each response variable a) pumping station, b) open water, c) artificial habitat. All models incorporated the random effect of time, day, and trial. Error bars represent 95% confidence intervals. Values fitted by light period, day = dots and dashed lines, night = crosses and dotted lines. Light grey background shading denoted by ‘unavailable’ indicates where habitat was not available in the experimental sequence. Models fitted using glmmTMB in R 4.3.1.


ggplot(data=c_hab_glmmm_treat, aes(x=x, y=predicted, fill=group))+
  geom_tile(aes(x=x, y=0.5,height = Inf, fill=present), alpha = 0.3,  show.legend = FALSE) + 
  scale_fill_manual(values = c("T" = NA, "F" = "grey90"), guide = FALSE) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), size=0.5,width = 0.6 ,position = position_dodge(width = 0.7),  show.legend = FALSE) +
  geom_point(aes(shape=group),size=1.5, position = position_dodge(width = 0.7),  show.legend = FALSE) +
  scale_shape_manual(values = c("Covered (B)" = 20, "Uncovered (A)" = 4))+
  scale_y_continuous(breaks = seq(0, 1,0.1), limits=c(0,1),expand=c(0.05,0)) +
  scale_x_discrete(expand=c(0,0))+
  labs(x = 'Experimental sequence',y= 'Daytime artifical habitat occupancy')+
  coord_cartesian(clip="off")+
  theme_JN()+
  theme(axis.text.x=element_text(size=8)) + 
  geom_text(data = c_hab_glmmm_treat %>% filter(present == "F"), aes(x = x, y = 0.5, label = "Unavailable"), size = 8/.pt, angle = 90, fontface = "italic")


Fig 2. Daytime artificial habitat occupancy fitted by Generalised Linear Mixed Model (log-linked Gaussian distribution). The model incorporated the random effect of time, day, and trial. Error bars represent 95% confidence intervals. Values fitted by habitat treatment, unsheltered (A) = crosses, sheltered (B) = dots. Light grey background shading denoted by ‘unavailable’ indicates where habitat was not available in the experimental sequence. Model fitted using glmmTMB in R 4.3.1.


ggplot(data=habitat_prob_df, aes(x = factor(x, labels = c("Pre-exclusion", "Post-exclusion")), y=predicted, fill=group))+
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width=.2) +
  geom_line(aes(group=group,linetype = group),  show.legend = FALSE)+
  geom_point(aes(shape=group),size=2,  show.legend = FALSE) +
  scale_shape_manual(values = c("1" = 20, "2" = 4))+
  scale_linetype_manual(values = c("1" = "dashed", "2" = "dotted"))+
  scale_y_continuous(breaks = seq(0, 1, 0.1), limits = c(0, 1), expand = c(0.05, 0), 
                     labels = scales::percent(seq(0, 1, 0.1), scale = 100)) +
  scale_x_discrete()+
  labs(x = 'Pumping station exclusion',y=expression(atop(NA, atop(textstyle('Predicted probabaility of'), textstyle('daytime habitat occupancy')))))+
  coord_cartesian(clip="off")+
  theme_JN()+
  theme(axis.text.x=element_text(size=8))


Fig 3. The predicted probability of fish occupying a habitat during daytime, fitted by Generalised Linear Mixed Models (logit-linked binomial distributions). The model incorporated the random effect of time, day, and trial. Error bars represent 95% confidence intervals. Values fitted by habitat type, artificial habitat = dots and dashed lines, pumping station = crosses and dotted lines. Models fitted using glmmTMB in R 4.3.1.



5. Post-hoc analysis

5.1 ANOVA

Post-hoc analysis can now be performed to support the model outputs. Tests of relevance here include paired t.tests and repeated measures ANOVA.

#AH
anov1<- aov(c_hab_normalized ~ sequence + Error(trial), data = roach_wide%>%filter(sequence!="Baseline"))
summary(anov1)
## 
## Error: trial
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## sequence   1  43.70   43.70   9.683 0.00508 **
## Residuals 22  99.29    4.51                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sequence     2  171.6   85.78   934.6 <2e-16 ***
## Residuals 5140  471.8    0.09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#AH
anov2<- aov(c_ps_normalized ~ sequence + Error(trial), data = roach_wide%>%filter(sequence!="I 2"))
summary(anov2)
## 
## Error: trial
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## sequence   1  126.0  125.97   13.15 0.00149 **
## Residuals 22  210.8    9.58                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sequence     2  380.5  190.24    2446 <2e-16 ***
## Residuals 5188  403.5    0.08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#OW
anov3<- aov(c_open_normalized ~ sequence + Error(trial), data = roach_wide)
summary(anov3)
## 
## Error: trial
##           Df Sum Sq Mean Sq F value Pr(>F)  
## sequence   1  15.24  15.239   3.013 0.0966 .
## Residuals 22 111.27   5.058                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sequence     3   49.5  16.487   157.5 <2e-16 ***
## Residuals 6915  723.8   0.105                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the repeated measures ANOVAs suggest thee is a significant difference in habitat occupancy across experimental sequence.

5.2 t.test

For the paired t.tests it will be necessary to create new dataframes with equal samples between grouping variables.
First identify the group lengths.

table(roach_wide %>%
        filter(light == "Day") %>%
        .$sequence)
## 
## Baseline      I 1      I 2      I 3 
##      690      648      648      630

We will use the minimal group length (630) for group comparisons.

Paired comparison between baseline and intervention 1 pumping station.

i1_base_ps<-  roach_wide %>%
  filter(light == "Day" & sequence %in% c("Baseline", "I 1")) %>%
  group_by(sequence) %>%
  mutate(row_num = row_number()) %>%
  filter(row_num <= 630) %>%
  ungroup() %>%
  select(-row_num)

t.test(c_ps_normalized ~ sequence,data=i1_base_ps, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_ps_normalized by sequence
## t = 7.2579, df = 629, p-value = 1.164e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.07747805 0.13495581
## sample estimates:
## mean difference 
##       0.1062169

Paired comparison between intervention1 and intervention 2 in artificial habitat.

i1_i2_hab<- roach_wide %>%
  filter(light == "Day" & sequence %in% c("I 1", "I 2")) %>%
  group_by(sequence) %>%
  mutate(row_num = row_number()) %>%
  filter(row_num <= 630) %>%
  ungroup() %>%
  select(-row_num)

t.test(c_hab_normalized ~ sequence,data=i1_i2_hab, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_hab_normalized by sequence
## t = -37.156, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.6448020 -0.5800657
## sample estimates:
## mean difference 
##      -0.6124339

Paired comparison between intervention 1 and intervention 3 in pumping station.

i1_i3_ps <- roach_wide %>%
  filter(sequence %in% c("I 1", "I 3") & light == "Day") %>%
  group_by(sequence) %>%
  filter(!(sequence == "I 1" & row_number() > 630)) %>%
  ungroup()

t.test(c_ps_normalized ~ sequence,data=i1_i3_ps, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_ps_normalized by sequence
## t = 42.112, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.6350747 0.6972004
## sample estimates:
## mean difference 
##       0.6661376

Paired comparison between intervention 2 and intervention 3 in artificial habitat.

i2_i3_hab<- roach_wide %>%
  filter(sequence %in% c("I 2", "I 3") & light == "Day") %>%
  group_by(sequence) %>%
  filter(!(sequence == "I 2" & row_number() > 630)) %>%
  ungroup()

t.test(c_hab_normalized ~ sequence,data=i2_i3_hab, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_hab_normalized by sequence
## t = -0.51646, df = 629, p-value = 0.6057
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.03366707  0.01964590
## sample estimates:
## mean difference 
##    -0.007010582

Paired comparison between intervention 1 and intervention 3 in artificial habit.

i1_i3_hab<- roach_wide %>%
  filter(sequence %in% c("I 1", "I 3") & light == "Day") %>%
  group_by(sequence) %>%
  filter(!(sequence == "I 1" & row_number() > 630)) %>%
  ungroup()

t.test(c_hab_normalized ~ sequence,data=i1_i3_hab, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_hab_normalized by sequence
## t = -38.233, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.6512604 -0.5876285
## sample estimates:
## mean difference 
##      -0.6194444

Paired comparison between habitat treatments in artificial habitat.

treatment_hab<- roach_wide %>%
  filter(sequence %in% c("I 1","I 2", "I 3") & light == "Day") %>%
  group_by(sequence) %>%
  filter(!(sequence %in% c("I 1", "I 2") & row_number() > 630)) %>%
  ungroup()

t.test(c_hab_normalized ~ treatment,data=treatment_hab, paired = TRUE)
## 
##  Paired t-test
## 
## data:  c_hab_normalized by treatment
## t = 4.2772, df = 944, p-value = 2.085e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.0273454 0.0737128
## sample estimates:
## mean difference 
##       0.0505291

Paired comparison between treatments in pumping station.

treatment_ps<- roach_wide %>%
  filter(sequence %in% c("Baseline", "I 1", "I 3") & light == "Day") %>%
  group_by(sequence) %>%
  filter(!(sequence %in% c("Baseline", "I 1") & row_number() > 630)) %>%
  ungroup()

t.test(c_ps_normalized ~ treatment,data=treatment_ps, paired = TRUE) 
## 
##  Paired t-test
## 
## data:  c_ps_normalized by treatment
## t = -4.3869, df = 944, p-value = 1.279e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.07606882 -0.02904582
## sample estimates:
## mean difference 
##     -0.05255732

6. Write-up

6.1 Methodology

During the video analysis 20,827 individual fish counts were made (i.e., a count of 0 – 12 h-1 for each habitat) and therefore we used Generalized Linear Mixed Effect models (GLMMs) to account for repeated measures (trial) and temporal dependency (hour, day) and to investigate habitat occupancy and preference under baseline, habitat introduction and habitat exclusion conditions. To standardise counts and generalise interpretation, the discrete fish count data (0 – 12) from the pumping station and artificial habitats, and open water were rescaled to the continuous variable ‘habitat occupancy’ (0 – 1) by applying min-max scaling (R function ‘scale’) prior to analysis. Individual GLMMs (R function ‘glmmTMB’ in package ‘glmmTMB’) were built to estimate habitat occupancy in the three response categories (pumping station, open water, artificial habitat). The rescaled response variables were left-skewed (i.e., many values close to 0) and thus a Gamma distribution was not used. Instead, zeros were removed from the rescaled response variables by adding 1 × 10−9 and gaussian distributions were used in the models with a log link applied to account for heteroscedasticity. We defined the model structure a priori, including experimental sequence (levels = baseline, I 1, I2, I3), light (levels = day, night), and treatment (levels = unsheltered (A), sheltered (B)) as fixed effects. The baseline and intervention 2 categories were not included in the artificial habitat and pumping station models, respectively, as these were inherently not measured according to the experimental design. Repeated measures and temporal dependency were treated by including the nested random effects of hour, day, and trial, which all had variances higher than zero, improved the models Akaike’s Information Criterion (AIC) and goodness of fit (Table 1). To separately analyse the effect of habitat exclusion on overall probability of fish occupying a given habitat post-exclusion (sequence: I 3) we fitted a pair of GLMMs in glmmTMB with binomial distribution (link = logit). In these models, fixed effects were limited to experimental sequence (I 1, I 3) and used the same random effects specified in the main models. Model assessment was performed on the final models by examining the predicted versus residual diagnostic plots according to Zurr.et al. (2007) (R function ‘simulateResiduals’ in package ‘DHARma’). Residuals of the artificial habitat model were normal, but the pumping station and open water variables deviated. However, sample sizes of the response variables were sufficiently large and thus paired t tests (R function ‘t.test’) and repeated measures ANOVA (R function ‘aov’) were used for post-hoc group comparisons when there were significant fixed effect differences in the models (Fagerland, 2012). Model predictions (means ± 95% CI) were calculated using R function ‘ggpredict’ in package ‘ggeffects’. All data were analysed using R version 4.3.1 (RCore Team, 2022) in RStudio 2023.06.0 (RStudio Team, 2022) and statistical figures were created using R packages ‘ggplot2’ and ‘ggpubr’.

6.2 Results

The raw data included a total of 30120, 29004, and 23924 fish counted across the 24 12-day trials in the artificial and pumping station habitats, and open water, respectively. After rescaling the response variables to represent habitat occupancy (0 – 1) we found that habitat occupancy was highest during the day, and night had a negative effect on habitat occupancy in the pumping station (GLMM: -0.253 ± 0.019, p = < 0.001) and artificial habitats (GLMM: -0.621 ± 0.025, p = < 0.001); open water occupancy was lowest during the day and increased at night (GLMM: 1.388 ± 0.046, p = < 0.001) (Figure 3, Table 2). The nocturnal effect i.e., fish occupying structured habitat during the day and dispersing in open water at night was fixed throughout the study, except where open water dispersal appeared to be reduced when we introduced artificial habitat (Figure 3b). No significant day-to-day variations in this relationship were observed within each experimental sequence (i.e., daytime occupancy was similar 24h vs 72h post-intervention). Our results show that habitat availability (i.e., experimental sequence) had a significant effect on habitat occupancy in the pumping station (repeated measures ANOVA: df = 2 F = 2446 p = <0.001), artificial habitat (repeated measures ANOVA: df = 2 F = 934.6 p = <0.001) and open water (repeated measures ANOVA: df = 3 F = 157.5 p = <0.001) models. Our hypothesis (i) that fish will occupy pumping station habitat during the day was supported during a baseline observation during which pumping station occupancy was predicted to reach 0.764 ± 0.019 (GLMM: p = < 0.001) (Figure 3a, Table 2). We introduced artificial habitat in a pre-exclusion period (intervention 1), but a predicted occupancy of 0.304 ± 0.038 (GLMM: p = < 0.001) in the artificial habitat only partially supported our hypothesis (ii) and suggested fish preferred the pumping station in which occupancy was marginally reduced to 0.697 ± 0.019 compared to baseline measurements (GLMM: p = < 0.001) (paired t.test: t = 4.846, p = <0.001) (Figure 3, Table 2). In support of our hypothesis (iii) artificial habitat occupancy was positively correlated with the exclusion period when we removed fish from the pumping station (intervention 2), during which daytime artificial habitat occupancy significantly increased to 0.933 ± 0.22 (GLMM: p = < 0.001) (paired t.test: t = -32.673, p = <0.001) (Figure 3c, Table 2). We reintroduced the pumping station in a post-exclusion period (intervention 3) to determine whether habitat exclusion affected the preference for pumping station demonstrated during baseline and pre-exclusion (intervention 1) observations. Our results supported our hypothesis (iv) and show that daytime habitat occupancy in the pumping station was significantly reduced to 0.098 ± 0.086 (GLMM: p = < 0.001) when compared to the pre-exclusion period (paired t.test: t = 42.112, p = <0.001) (Figure 3a, Table 2). In contrast, artificial habitat occupancy was significantly increased to 0.812 ± 0.024 (GLMM: p = < 0.001) when compared to pre-exclusion (paired t.test: t = -38.233, p = < 0.001) and only marginally reduced when compared to the exclusion period (paired t.test: t = - 0.516, p = 0.605) (Figure 3c, Table 2). In support of our hypothesis (v), we found significant effects of artificial habitat treatment where artificial habitat occupancy was negatively correlated with unsheltered (A) treatments (GLMM: -0.085 ± 0.015, p = < 0.001) and was significantly higher during trials which received sheltered treatments (paired t.test: t = 4.277, p = < 0.001) (Figure 4; Table 2). Accordingly, pumping station occupancy was positively correlated with unsheltered (A) treatments (GLMM: 0.042 ± 0.019, p = 0.026) and was significantly higher during trials which received unsheltered treatments (paired t.test: t = - 4.386, p = < 0.001) (Table 2). Open water occupancy was unaffected by treatment (GLMM: 0.039 ± 0.024, p = 0.098) (Table 2). Overall, our results show that daytime artificial habitat occupancy increased 2.7x by the end of the study, reaching 0.745 ± 0.024 and 0.812 ± 0.024 in the unsheltered and sheltered treatments, respectively. Based on our results, habitat exclusion was considered a primary driver for whether fish would preferentially select a pumping station or artificial habitat. Accordingly, we performed a pair of binomial GLMMs to examine the isolated effect of exclusion on habitat occupancy probability. These models add further support for our hypothesis (iv) and show the probability of fish occupying the pumping station was negatively correlated with the post-exclusion period (GLMM: -3.385 ± 0.166, p = < 0.001) during which occupancy probability significantly decreased from 74.1 ± 8.97 % in the pre-exclusion period to 8.84 ± 14.02 % in the post-exclusion period (Figure 5, Table 2). In contrast, we found that the probability of fish occupying artificial habitat was positively correlated with the post-exclusion period (GLMM: 2.909 ± 0.164, p = < 0.001) and occupancy probability significantly increased from 24.7 ± 10.3 % in the pre-exclusion period to 85.7 ± 12.6 % in the post-exclusion period (Figure 5, Table 2).